EDA Package Review

Introduction

This document simply evaluates the capabilities of a variety of R EDA packages. The goal is to select a subset to use in future projects consistently.

Data

Will use data provided by Robbie used the the mortgage lead score prediction algorithm.

## Observations: 45,562
## Variables: 11
## $ lead_created_date  <dttm> 2018-05-25 13:40:00, 2018-08-15 22:33:00, ...
## $ loan_purpose       <chr> "REFI", "REFI", "REFI", "Purchase", "Purcha...
## $ self_reported_fico <dbl> 700, 640, 750, 720, 640, 750, 720, 700, 680...
## $ loan_to_value      <dbl> 86.9, 91.1, 73.5, 84.6, 97.0, 97.5, 95.0, 9...
## $ loan_amount        <dbl> 152000, 410000, 238902, 330000, 80510, 3900...
## $ property_state     <chr> "FL", "CA", "NC", "CT", "GA", "IL", "CA", "...
## $ existing_customer  <chr> "Yes", "Unknown", "Unknown", "Yes", "Unknow...
## $ time_to_purchase   <chr> "N/A", "N/A", "N/A", "Unknown", "Unknown", ...
## $ eventually_fund    <chr> "No", "No", "No", "No", "No", "No", "No", "...
## $ lead_source_code   <chr> "Marketing-Email-Web", "Bankrate", "Creditq...
## $ time_lead_to_call  <dbl> 345, 994, 1323, 932, 874, 307, 377, 1212, 3...

Quick Aside

The code below was built using the original raw data. Nice code - took me a while to split a long table into 2. This might be helpful in the future. (It seems pretty simple now - perhaps I should be embarrassed.)

Lead Source
Marketing-Email-Web
Bankrate
Creditqual-Email-Web
Web - Contact Us form
Zillow
Web - Rate Table
Paid-Search-Web
Informa-Phone
Third-Party-Web
Mobile - Contact Us form
LendingTree-WarmTransfer-Phone
Web - Payment calculator
Informa-Web
Quinstreet-Web
Ally Website-Phone
Web - Affordability calculator
Lead Source
Marketing Lead
Web - Refi calculator
Affinity-Dealer-Phone
Creditqual-Email-Phone
Paid-Search-Phone
MLS-Listing-Web
Dealer-Referral
MLS-Listing-Phone
LendingTree-RateTable
After-Hours-Call-Phone
Third-Party-Phone
Quinstreet
Mobile - Rate Table
Affinity-Ally-Employee-Phone
Affinity-Vendor-Phone
Affinity-Ally-Employee-Web
LendingTree-RateTable-Phone

EDA Package Review

The following R Packages are introduced:

  • DataExplorer
  • SmartEDA
  • funModeling
  • rpivotTable
  • XDA
  • skmir
  • PerformanceAnalytics
  • descriptr
  • inspectdf

Also included:

  • Miscellaneous EDR R code snippets
  • synthpop - generate anonymized data from existing data while maintaining data characteristics

Data Explorer

Package intro found here It is the only package (that I have found) that can create a complete EDA report with one function! create_report!

plot_str creates an interactive plot displaying the structure of the data. It does not work interactively in rMarkdown so an image of the output is provided below.

## # A tibble: 1 x 9
##    rows columns discrete_columns continuous_colu~ all_missing_col~
##   <int>   <int>            <int>            <int>            <int>
## 1 45562      11                7                4                0
## # ... with 4 more variables: total_missing_values <int>,
## #   complete_rows <int>, total_observations <int>, memory_usage <dbl>

It is easy to customize the plots using ggthemes. There are many to choose from:

##  [1] "theme_base"            "theme_calc"           
##  [3] "theme_clean"           "theme_economist"      
##  [5] "theme_economist_white" "theme_excel"          
##  [7] "theme_excel_new"       "theme_few"            
##  [9] "theme_fivethirtyeight" "theme_foundation"     
## [11] "theme_gdocs"           "theme_hc"             
## [13] "theme_igray"           "theme_map"            
## [15] "theme_pander"          "theme_par"            
## [17] "theme_solarized"       "theme_solarized_2"    
## [19] "theme_solid"           "theme_stata"          
## [21] "theme_tufte"           "theme_wsj"

You can find examples here.

Here is the default plot:

And here is a customized plot:

## 3 columns ignored with more than 10 categories.
## lead_created_date: 41813 categories
## property_state: 52 categories
## lead_source_code: 33 categories

DataExplorer can create an EDA Report automatically - and it is pretty good! It is super good if you need to do something fast.

I encourage you to run the code below (it creates many messages when building - messages I cannot silence so the code chunk is not evaluated when knitr is run to produce HTML output from this document.)

Overall impression - good package, competitive with other EDA package options.

SmartEDA

##                                 Descriptions       Obs
## 1                         Sample size (Nrow)     45562
## 2                    No. of Variables (Ncol)        11
## 3                   No. of Numeric Variables         4
## 4                    No. of Factor Variables         0
## 5                      No. of Text Variables         6
## 6                   No. of Logical Variables         0
## 7                      No. of Date Variables         2
## 8   No. of Zero variance Variables (Uniform)         0
## 9      %. of Variables having complete cases 100% (11)
## 10 %. of Variables having <50% missing cases    0% (0)
## 11 %. of Variables having >50% missing cases    0% (0)
## 12 %. of Variables having >90% missing cases    0% (0)
##    S.no       Variable Name  Variable Type % of Missing
## 1     1 lead_created_date** POSIXct:POSIXt            0
## 2     2       loan_purpose*      character            0
## 3     3  self_reported_fico        numeric            0
## 4     4       loan_to_value        numeric            0
## 5     5         loan_amount        numeric            0
## 6     6     property_state*      character            0
## 7     7  existing_customer*      character            0
## 8     8   time_to_purchase*      character            0
## 9     9    eventually_fund*      character            0
## 10   10   lead_source_code*      character            0
## 11   11   time_lead_to_call        numeric            0
##    No. of Unique values
## 1                 41813
## 2                     3
## 3                    13
## 4                  7392
## 5                  4386
## 6                    52
## 7                     3
## 8                     5
## 9                     2
## 10                   33
## 11                18078

ExpNumStat provides data overviews by changing parameters, different view are possible.

##                Vname Group    TN nNeg nZero  nPos NegInf PosInf NA_Value
## 3        loan_amount   All 45562    0     8 45531      0      0       23
## 2      loan_to_value   All 45562    0    17 45527      0      0       18
## 1 self_reported_fico   All 45562    0     0 45541      0      0       21
## 4  time_lead_to_call   All 45562    0     0 45562      0      0        0
##   Per_of_Missing         sum   min     max   mean   median       SD
## 3          0.050 13910384132   0.0 9900000 305461 210499.0 382271.3
## 2          0.040    13163643   0.0 8000000    289     86.1  37733.1
## 1          0.046    32377621 579.0     761    711    720.0     45.4
## 4          0.000   123178882   3.3  350416   2704    582.4  12476.5
##        CV      IQR Skweness  Kurtosis
## 3   1.251 230473.5    9.202   143.652
## 2 130.550     17.2  209.536 44364.599
## 1   0.064     70.0   -0.744    -0.786
## 4   4.615   1133.1   13.462   232.315
##                Vname               Group    TN nNeg nZero  nPos NegInf
## 3        loan_amount  eventually_fund:No 44887    0     8 44856      0
## 7        loan_amount eventually_fund:Yes   675    0     0   675      0
## 2      loan_to_value  eventually_fund:No 44887    0    17 44852      0
## 6      loan_to_value eventually_fund:Yes   675    0     0   675      0
## 1 self_reported_fico  eventually_fund:No 44887    0     0 44866      0
## 5 self_reported_fico eventually_fund:Yes   675    0     0   675      0
## 4  time_lead_to_call  eventually_fund:No 44887    0     0 44887      0
## 8  time_lead_to_call eventually_fund:Yes   675    0     0   675      0
##   PosInf NA_Value Per_of_Missing         sum    min     max     mean
## 3      0       23          0.051 13666715361   0.00 9900000 304625.4
## 7      0        0          0.000   243668771   1.00 2748000 360990.8
## 2      0       18          0.040    13112552   0.00 8000000    292.2
## 6      0        0          0.000       51091   9.85     100     75.7
## 1      0       21          0.047    31881212 579.00     761    710.6
## 5      0        0          0.000      496409 620.00     761    735.4
## 4      0        0          0.000   121774657   3.30  350416   2712.9
## 8      0        0          0.000     1404225   5.00  132230   2080.3
##     median       SD      CV      IQR Skweness Kurtosis
## 3 210000.0 383765.9   1.260 232000.0     9.22   143.58
## 7 290000.0 258848.2   0.717 324375.0     2.19    11.49
## 2     86.7  38015.8 130.084     17.1   207.98 43707.01
## 6     80.0     16.0   0.212     16.9    -1.08     1.16
## 1    720.0     45.5   0.064     70.0    -0.73    -0.81
## 5    750.0     28.4   0.039     30.0    -1.91     3.64
## 4    590.4  12533.6   4.620   1136.2    13.44   231.16
## 8    385.0   7770.6   3.735    929.6    10.90   147.67
##                 Vname               Group    TN nNeg nZero  nPos NegInf
## 3         loan_amount eventually_fund:All 45562    0     8 45531      0
## 7         loan_amount  eventually_fund:No 44887    0     8 44856      0
## 11        loan_amount eventually_fund:Yes   675    0     0   675      0
## 2       loan_to_value eventually_fund:All 45562    0    17 45527      0
## 6       loan_to_value  eventually_fund:No 44887    0    17 44852      0
## 10      loan_to_value eventually_fund:Yes   675    0     0   675      0
## 1  self_reported_fico eventually_fund:All 45562    0     0 45541      0
## 5  self_reported_fico  eventually_fund:No 44887    0     0 44866      0
## 9  self_reported_fico eventually_fund:Yes   675    0     0   675      0
## 4   time_lead_to_call eventually_fund:All 45562    0     0 45562      0
## 8   time_lead_to_call  eventually_fund:No 44887    0     0 44887      0
## 12  time_lead_to_call eventually_fund:Yes   675    0     0   675      0
##    PosInf NA_Value Per_of_Missing         sum    min     max     mean
## 3       0       23          0.050 13910384132   0.00 9900000 305460.9
## 7       0       23          0.051 13666715361   0.00 9900000 304625.4
## 11      0        0          0.000   243668771   1.00 2748000 360990.8
## 2       0       18          0.040    13163643   0.00 8000000    289.0
## 6       0       18          0.040    13112552   0.00 8000000    292.2
## 10      0        0          0.000       51091   9.85     100     75.7
## 1       0       21          0.046    32377621 579.00     761    711.0
## 5       0       21          0.047    31881212 579.00     761    710.6
## 9       0        0          0.000      496409 620.00     761    735.4
## 4       0        0          0.000   123178882   3.30  350416   2703.5
## 8       0        0          0.000   121774657   3.30  350416   2712.9
## 12      0        0          0.000     1404225   5.00  132230   2080.3
##      median       SD      CV      IQR Skweness  Kurtosis
## 3  210499.0 382271.3   1.251 230473.5    9.202   143.652
## 7  210000.0 383765.9   1.260 232000.0    9.225   143.581
## 11 290000.0 258848.2   0.717 324375.0    2.192    11.489
## 2      86.1  37733.1 130.550     17.2  209.536 44364.599
## 6      86.7  38015.8 130.084     17.1  207.978 43707.006
## 10     80.0     16.0   0.212     16.9   -1.084     1.159
## 1     720.0     45.4   0.064     70.0   -0.744    -0.786
## 5     720.0     45.5   0.064     70.0   -0.730    -0.810
## 9     750.0     28.4   0.039     30.0   -1.908     3.637
## 4     582.4  12476.5   4.615   1133.1   13.462   232.315
## 8     590.4  12533.6   4.620   1136.2   13.439   231.164
## 12    385.0   7770.6   3.735    929.6   10.904   147.666

Graphical representation of all numeric features - Density plot (Univariate)

frequency for all categorical independent variables

##             Variable       Valid Frequency Percent CumPercent
## 1       loan_purpose    Purchase     35680   78.31       78.3
## 2       loan_purpose        REFI      9862   21.65      100.0
## 3       loan_purpose     Unknown        20    0.04      100.0
## 4       loan_purpose       TOTAL     45562      NA         NA
## 5  existing_customer          No      8335   18.29       18.3
## 6  existing_customer     Unknown     13011   28.56       46.8
## 7  existing_customer         Yes     24216   53.15      100.0
## 8  existing_customer       TOTAL     45562      NA         NA
## 9   time_to_purchase Immediately      6541   14.36       14.4
## 10  time_to_purchase       Later      3353    7.36       21.7
## 11  time_to_purchase         N/A      9862   21.65       43.4
## 12  time_to_purchase        Soon      5567   12.22       55.6
## 13  time_to_purchase     Unknown     20239   44.42      100.0
## 14  time_to_purchase       TOTAL     45562      NA         NA
## 15   eventually_fund          No     44887   98.52       98.5
## 16   eventually_fund         Yes       675    1.48      100.0
## 17   eventually_fund       TOTAL     45562      NA         NA

Bar plots for all categorical variables - simple but well-formatted plots.

When the target is a continuous variable, more plots are available. Better plots would be displayed in the data were different.)

And there is more to the package that I have skipped.

But before we leave this one:

Parallel plot is the equivalent of a spider chart, but with Cartesian coordinates. Thus, it is often preferred.

To understand parallel coordinate plot, look here.

Overall impression - Useful

funModeling

This package provides more functionality than illustrated below. An example of this - gain and lift charts! For these reasons, I suggest funModleing be part of your daily toolset.

Note; Below I occasionally used a different data set to illustrate all the features more effectively.

##              variable q_zeros p_zeros q_na p_na q_inf p_inf           type
## 1   lead_created_date       0    0.00    0 0.00     0     0 POSIXct/POSIXt
## 2        loan_purpose       0    0.00    0 0.00     0     0      character
## 3  self_reported_fico       0    0.00   21 0.05     0     0        numeric
## 4       loan_to_value      17    0.04   18 0.04     0     0        numeric
## 5         loan_amount       8    0.02   23 0.05     0     0        numeric
## 6      property_state       0    0.00   40 0.09     0     0      character
## 7   existing_customer       0    0.00    0 0.00     0     0      character
## 8    time_to_purchase       0    0.00    0 0.00     0     0      character
## 9     eventually_fund       0    0.00    0 0.00     0     0      character
## 10   lead_source_code       0    0.00    0 0.00     0     0      character
## 11  time_lead_to_call       0    0.00    0 0.00     0     0        numeric
##    unique
## 1   41813
## 2       3
## 3      12
## 4    7391
## 5    4385
## 6      51
## 7       3
## 8       5
## 9       2
## 10     33
## 11  18078

##   loan_purpose frequency percentage cumulative_perc
## 1     Purchase     35680      78.31            78.3
## 2         REFI      9862      21.65           100.0
## 3      Unknown        20       0.04           100.0

##    property_state frequency percentage cumulative_perc
## 1              CA      4860      10.68            10.7
## 2              NY      3971       8.72            19.4
## 3              TX      3820       8.39            27.8
## 4              FL      3741       8.22            36.0
## 5              NJ      1936       4.25            40.3
## 6              IL      1897       4.17            44.4
## 7              PA      1861       4.09            48.5
## 8              GA      1684       3.70            52.2
## 9              NC      1571       3.45            55.7
## 10             OH      1478       3.25            58.9
## 11             MI      1276       2.80            61.7
## 12             MA      1164       2.56            64.3
## 13             VA      1158       2.54            66.8
## 14             MD       991       2.18            69.0
## 15             MO       917       2.01            71.0
## 16             WA       883       1.94            73.0
## 17             TN       876       1.92            74.9
## 18             AZ       872       1.92            76.8
## 19             CO       869       1.91            78.7
## 20             IN       769       1.69            80.4
## 21             SC       724       1.59            82.0
## 22             MN       693       1.52            83.5
## 23             WI       572       1.26            84.8
## 24             LA       537       1.18            85.9
## 25             NV       492       1.08            87.0
## 26             OR       490       1.08            88.1
## 27             CT       474       1.04            89.1
## 28             AL       462       1.01            90.2
## 29             AR       457       1.00            91.2
## 30             OK       375       0.82            92.0
## 31             KS       330       0.72            92.7
## 32             KY       324       0.71            93.4
## 33             UT       296       0.65            94.0
## 34             MS       273       0.60            94.7
## 35             IA       256       0.56            95.2
## 36             NE       246       0.54            95.8
## 37             WV       241       0.53            96.3
## 38             NH       232       0.51            96.8
## 39             NM       192       0.42            97.2
## 40             ME       177       0.39            97.6
## 41             ID       169       0.37            98.0
## 42             DC       163       0.36            98.3
## 43             DE       162       0.36            98.7
## 44             MT       130       0.29            99.0
## 45             RI       111       0.24            99.2
## 46             AK        76       0.17            99.4
## 47             ND        74       0.16            99.5
## 48             SD        73       0.16            99.7
## 49             VT        71       0.16            99.9
## 50             WY        53       0.12           100.0
## 51             HI         3       0.01           100.0

##   existing_customer frequency percentage cumulative_perc
## 1               Yes     24216       53.1            53.1
## 2           Unknown     13011       28.6            81.7
## 3                No      8335       18.3           100.0

##   time_to_purchase frequency percentage cumulative_perc
## 1          Unknown     20239      44.42            44.4
## 2              N/A      9862      21.65            66.1
## 3      Immediately      6541      14.36            80.4
## 4             Soon      5567      12.22            92.7
## 5            Later      3353       7.36           100.0

##   eventually_fund frequency percentage cumulative_perc
## 1              No     44887      98.52            98.5
## 2             Yes       675       1.48           100.0

##                  lead_source_code frequency percentage cumulative_perc
## 1           Web - Contact Us form     15753      34.57            34.6
## 2            Creditqual-Email-Web      8850      19.42            54.0
## 3                        Bankrate      6585      14.45            68.4
## 4                 Paid-Search-Web      5074      11.14            79.6
## 5             Marketing-Email-Web      3663       8.04            87.6
## 6                          Zillow      2225       4.88            92.5
## 7                Web - Rate Table       738       1.62            94.1
## 8                 Third-Party-Web       539       1.18            95.3
## 9        Web - Payment calculator       428       0.94            96.2
## 10                    Informa-Web       403       0.88            97.1
## 11             Ally Website-Phone       289       0.63            97.8
## 12                 Marketing Lead       280       0.61            98.4
## 13 Web - Affordability calculator       221       0.49            98.8
## 14          Web - Refi calculator       165       0.36            99.2
## 15                 Quinstreet-Web       124       0.27            99.5
## 16 LendingTree-WarmTransfer-Phone        79       0.17            99.7
## 17       Mobile - Contact Us form        41       0.09            99.7
## 18                MLS-Listing-Web        31       0.07            99.8
## 19          LendingTree-RateTable        12       0.03            99.8
## 20         After-Hours-Call-Phone         9       0.02            99.9
## 21         Creditqual-Email-Phone         9       0.02            99.9
## 22          Affinity-Dealer-Phone         8       0.02            99.9
## 23              Third-Party-Phone         8       0.02            99.9
## 24   Affinity-Ally-Employee-Phone         6       0.01            99.9
## 25              Paid-Search-Phone         5       0.01            99.9
## 26                  Informa-Phone         4       0.01           100.0
## 27            Mobile - Rate Table         4       0.01           100.0
## 28     Affinity-Ally-Employee-Web         2       0.00           100.0
## 29                Dealer-Referral         2       0.00           100.0
## 30    LendingTree-RateTable-Phone         2       0.00           100.0
## 31          Affinity-Vendor-Phone         1       0.00           100.0
## 32              MLS-Listing-Phone         1       0.00           100.0
## 33                     Quinstreet         1       0.00           100.0
## [1] "Variables processed: loan_purpose, property_state, existing_customer, time_to_purchase, eventually_fund, lead_source_code"
##    property_state frequency percentage cumulative_perc
## 1              CA      4860      10.67            10.7
## 2              NY      3971       8.72            19.4
## 3              TX      3820       8.38            27.8
## 4              FL      3741       8.21            36.0
## 5              NJ      1936       4.25            40.2
## 6              IL      1897       4.16            44.4
## 7              PA      1861       4.08            48.5
## 8              GA      1684       3.70            52.2
## 9              NC      1571       3.45            55.6
## 10             OH      1478       3.24            58.9

##    property_state2 frequency percentage cumulative_perc
## 1            other      5950      13.06            13.1
## 2               CA      4860      10.67            23.7
## 3               NY      3971       8.72            32.5
## 4               TX      3820       8.38            40.8
## 5               FL      3741       8.21            49.0
## 6               NJ      1936       4.25            53.3
## 7               IL      1897       4.16            57.5
## 8               PA      1861       4.08            61.5
## 9               GA      1684       3.70            65.2
## 10              NC      1571       3.45            68.7
## 11              OH      1478       3.24            71.9
## 12              MI      1276       2.80            74.7
## 13              MA      1164       2.55            77.3
## 14              VA      1158       2.54            79.8
## 15              MD       991       2.18            82.0
## 16              MO       917       2.01            84.0
## 17              WA       883       1.94            85.9
## 18              TN       876       1.92            87.9
## 19              AZ       872       1.91            89.8
## 20              CO       869       1.91            91.7
## 21              IN       769       1.69            93.4
## 22              SC       724       1.59            95.0
## 23              MN       693       1.52            96.5
## 24              WI       572       1.26            97.7
## 25              LA       537       1.18            98.9
## 26              NV       492       1.08           100.0
##                       var   en    mi       ig       gr
## 1  heart_disease_severity 1.85 0.995 0.995084 0.539066
## 2                    thal 2.03 0.209 0.209455 0.168046
## 3             exer_angina 1.77 0.139 0.139139 0.152639
## 4            exter_angina 1.77 0.139 0.139139 0.152639
## 5              chest_pain 2.53 0.205 0.205019 0.118029
## 6       num_vessels_flour 2.38 0.182 0.181522 0.115774
## 7                   slope 2.18 0.112 0.112422 0.086880
## 8       serum_cholestoral 7.48 0.561 0.560556 0.079556
## 9                  gender 1.84 0.057 0.057254 0.063297
## 10                oldpeak 4.87 0.249 0.249167 0.060358
## 11         max_heart_rate 6.83 0.334 0.333617 0.054070
## 12 resting_blood_pressure 5.57 0.143 0.142555 0.030239
## 13                    age 5.93 0.137 0.137175 0.027055
## 14        resting_electro 2.06 0.024 0.024148 0.022194
## 15    fasting_blood_sugar 1.60 0.000 0.000459 0.000758

## [0.0,0.2) [0.2,1.5) [1.5,6.2] 
##       106       107        90

Overall impression - best overall EDA package with functionality beyond EDA.

rpivotTable

Pretty cool - an interactive tool for precision/recall!

Overall impression - useful in limited cases (even thought it is pretty cool.) Might be useful to include in a shiny app.

XDA

##                        n   mean       sd     max   min   range nunique
## self_reported_fico 45541    711     45.4     761 579.0     182      13
## loan_to_value      45544    289  37733.1 8000000   0.0 8000000    7392
## loan_amount        45539 305461 382271.3 9900000   0.0 9900000    4386
## time_lead_to_call  45562   2704  12476.5  350416   3.3  350413   18078
##                    nzeros      iqr lowerbound upperbound noutlier
## self_reported_fico      0     70.0      575.0        855        0
## loan_to_value          17     17.2       51.9        121     3688
## loan_amount             8 230497.0  -216219.0     705746     2969
## time_lead_to_call       0   1133.1    -1411.4       3121     5797
##                     kurtosis skewness  mode miss  miss%      1%      5%
## self_reported_fico    -0.786   -0.743   750   21 0.0461   620.0   620.0
## loan_to_value      44362.651  209.529    80   18 0.0395    17.8    43.6
## loan_amount          143.645    9.201 60000   23 0.0505 60000.0 64000.0
## time_lead_to_call    232.304   13.462   246    0 0.0000   241.3   245.1
##                         25%      50%    75%      95%     99%
## self_reported_fico    680.0    720.0    750    761.0     761
## loan_to_value          77.8     86.1     95     99.4     100
## loan_amount        129526.5 210499.0 360000 792000.0 1500000
## time_lead_to_call     288.3    582.4   1421   7359.5   42506
##                       n miss  miss% unique
## loan_purpose      45562    0 0.0000      3
## property_state    45522   40 0.0878     52
## existing_customer 45562    0 0.0000      3
## time_to_purchase  45562    0 0.0000      5
## eventually_fund   45562    0 0.0000      2
## lead_source_code  45562    0 0.0000     33
## property_state2   45562    0 0.0000     26
##                                                                                                                        top5levels:count
## loan_purpose                                                                                      Purchase:35680, REFI:9862, Unknown:20
## property_state                                                                              CA:4860, NY:3971, TX:3820, FL:3741, NJ:1936
## existing_customer                                                                                     Yes:24216, Unknown:13011, No:8335
## time_to_purchase                                                       Unknown:20239, N/A:9862, Immediately:6541, Soon:5567, Later:3353
## eventually_fund                                                                                                       No:44887, Yes:675
## lead_source_code  Web - Contact Us form:15753, Creditqual-Email-Web:8850, Bankrate:6585, Paid-Search-Web:5074, Marketing-Email-Web:3663
## property_state2                                                                          other:5950, CA:4860, NY:3971, TX:3820, FL:3741

Overall impression - does not do anything better than the other packages reviewed.

skimr

## Skim summary statistics
##  n obs: 45562 
##  n variables: 12 
## 
## -- Variable type:character --------------------------------------------------------------------------------------------------------------------------------------------------
##           variable missing complete     n min max empty n_unique
##    eventually_fund       0    45562 45562   2   3     0        2
##  existing_customer       0    45562 45562   2   7     0        3
##   lead_source_code       0    45562 45562   6  30     0       33
##       loan_purpose       0    45562 45562   4   8     0        3
##     property_state      40    45522 45562   2   2     0       51
##    property_state2       0    45562 45562   2   5     0       26
##   time_to_purchase       0    45562 45562   3  11     0        5
## 
## -- Variable type:numeric ----------------------------------------------------------------------------------------------------------------------------------------------------
##            variable missing complete     n      mean        sd    p0
##         loan_amount      23    45539 45562 305460.9  382271.29   0  
##       loan_to_value      18    45544 45562    289.03  37733.07   0  
##  self_reported_fico      21    45541 45562    710.96     45.38 579  
##   time_lead_to_call       0    45562 45562   2703.54  12476.47   3.3
##        p25       p50      p75    p100     hist
##  129526.5  210499    360000   9900000 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##      77.78     86.06     95   8000000 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     680       720       750       761 <U+2581><U+2581><U+2582><U+2582><U+2581><U+2582><U+2582><U+2587>
##     288.3     582.4    1421.4  350416 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 
## -- Variable type:POSIXct ----------------------------------------------------------------------------------------------------------------------------------------------------
##           variable missing complete     n        min        max     median
##  lead_created_date       0    45562 45562 2017-12-31 2018-10-22 2018-05-12
##  n_unique
##     41813

Skim summary statistics
n obs: 45562
n variables: 12

Variable type: character
variable missing complete n min max empty n_unique
eventually_fund 0 45562 45562 2 3 0 2
existing_customer 0 45562 45562 2 7 0 3
lead_source_code 0 45562 45562 6 30 0 33
loan_purpose 0 45562 45562 4 8 0 3
property_state 40 45522 45562 2 2 0 51
property_state2 0 45562 45562 2 5 0 26
time_to_purchase 0 45562 45562 3 11 0 5
Variable type: numeric
variable missing complete n mean sd p0 p25 p50 p75 p100 hist
loan_amount 23 45539 45562 305460.9 382271.29 0 129526.5 210499 360000 9900000 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
loan_to_value 18 45544 45562 289.03 37733.07 0 77.78 86.06 95 8000000 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
self_reported_fico 21 45541 45562 710.96 45.38 579 680 720 750 761 <U+2581><U+2581><U+2582><U+2582><U+2581><U+2582><U+2582><U+2587>
time_lead_to_call 0 45562 45562 2703.54 12476.47 3.3 288.3 582.4 1421.4 350416 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
Variable type: POSIXct
variable missing complete n min max median n_unique
lead_created_date 0 45562 45562 2017-12-31 2018-10-22 2018-05-12 41813

Overall impression - good when you need a compact view or summary statistics.

PerformanceAnalytics

PerformanceAnalytics is a library of functions designed for evaluating the performance and risk characteristics of financial assets or funds. It does to apply to more general use.

descriptr

This is one of the newest EDA package. I was made aware of it 3/5/19.

https://descriptr.rsquaredacademy.com/

  • Summary statistics
  • Two way tables
  • One way table
  • Group wise summary
  • Multiple variable statistics
  • Multiple one way tables
  • Multiple two way tables

I will concentrate on the visuralizations suppported in the package.

THere is a Shivy version too! ds_launch_shiny_app()

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
##  $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...

The ds_screener() function will screen data frames and return details such as variable names, class, levels and missing values. Theplot.screener() creates bar plots to visualize % of missing observations for each variable in a data frame.

## -----------------------------------------------------------------------
## |  Column Name  |  Data Type  |  Levels   |  Missing  |  Missing (%)  |
## -----------------------------------------------------------------------
## |      mpg      |   numeric   |    NA     |     0     |       0       |
## |      cyl      |   factor    |   4 6 8   |     0     |       0       |
## |     disp      |   numeric   |    NA     |     0     |       0       |
## |      hp       |   numeric   |    NA     |     0     |       0       |
## |     drat      |   numeric   |    NA     |     0     |       0       |
## |      wt       |   numeric   |    NA     |     0     |       0       |
## |     qsec      |   numeric   |    NA     |     0     |       0       |
## |      vs       |   factor    |    0 1    |     0     |       0       |
## |      am       |   factor    |    0 1    |     0     |       0       |
## |     gear      |   factor    |   3 4 5   |     0     |       0       |
## |     carb      |   factor    |1 2 3 4 6 8|     0     |       0       |
## -----------------------------------------------------------------------
## 
##  Overall Missing Values           0 
##  Percentage of Missing Values     0 %
##  Rows with Missing Values         0 
##  Columns With Missing Values      0

Overall Impression:

inspectdf

Data

## Parsed with column specification:
## cols(
##   ID = col_double(),
##   Gender = col_character(),
##   Grade = col_double(),
##   Horoscope = col_character(),
##   Subject = col_character(),
##   IntExt = col_character(),
##   OptPest = col_character(),
##   ScreenTime = col_double(),
##   Sleep = col_double(),
##   PhysActive = col_double(),
##   HrsHomework = col_double(),
##   SpendTime1 = col_character(),
##   SpendTime2 = col_character(),
##   Self1 = col_character(),
##   Self2 = col_character(),
##   Career = col_character(),
##   Superpower = col_character()
## )
## # A tibble: 10 x 17
##       ID Gender Grade Horoscope Subject IntExt OptPest ScreenTime Sleep
##    <dbl> <chr>  <dbl> <chr>     <chr>   <chr>  <chr>        <dbl> <dbl>
##  1     1 male       4 Scorpio   Math    Extra~ Optimi~          1     7
##  2     2 female     4 Capricorn Gym     Extra~ Optimi~          1     8
##  3     3 male       4 Taurus    Math    Intro~ Optimi~          4     9
##  4     4 male       4 Aquarius  Math    Don't~ Don't ~          3     9
##  5     5 male       4 Scorpio   Gym     Don't~ Don't ~          1     9
##  6     6 male       4 Pisces    Gym     Extra~ Optimi~          2     9
##  7     7 male       3 Scorpio   Art     Intro~ Optimi~          1    11
##  8     8 male       6 Taurus    Math    Extra~ Optimi~          4     9
##  9     9 male       6 Aries     Gym     Intro~ Pessim~          6     8
## 10    10 male       6 Pisces    Math    Intro~ Don't ~          3     9
## # ... with 8 more variables: PhysActive <dbl>, HrsHomework <dbl>,
## #   SpendTime1 <chr>, SpendTime2 <chr>, Self1 <chr>, Self2 <chr>,
## #   Career <chr>, Superpower <chr>
## [1] 185  17

We need three data frames. We need one data frame with the complete data set. We simply rename df to allGrades. We also need two subsetted data sets to leverage the packages easy data frame comparison features. We create the data frames oldGrades (6-8) and youngGrades (3-5).

inspectdf Functions

Simply pass in a dataframe, or two (for comparisons) and set show_plot = TRUE. The output will include both a tibble with the raw data and a visualization.

inspect_types() function

## The `show_plot = TRUE` is deprecated and will be removed in a future version.  The `show_plot()` function should be used instead.  For more info, check out the help file ?show_plot()

## # A tibble: 2 x 4
##   type        cnt  pcnt col_name  
##   <chr>     <int> <dbl> <list>    
## 1 character    11  64.7 <chr [11]>
## 2 numeric       6  35.3 <chr [6]>

Compare between youngGrades and oldGrades

## The `show_plot = TRUE` is deprecated and will be removed in a future version.  The `show_plot()` function should be used instead.  For more info, check out the help file ?show_plot()

## # A tibble: 2 x 5
##   type      cnt_1 pcnt_1 cnt_2 pcnt_2
##   <chr>     <int>  <dbl> <int>  <dbl>
## 1 character    11   64.7    11   64.7
## 2 numeric       6   35.3     6   35.3

inspect_mem() function

## The `show_plot = TRUE` is deprecated and will be removed in a future version.  The `show_plot()` function should be used instead.  For more info, check out the help file ?show_plot()

## # A tibble: 17 x 3
##    col_name    size     pcnt
##    <chr>       <chr>   <dbl>
##  1 Career      8.04 Kb 12.8 
##  2 SpendTime2  7.88 Kb 12.5 
##  3 Superpower  7.8 Kb  12.4 
##  4 Self2       7.27 Kb 11.6 
##  5 Self1       7.21 Kb 11.5 
##  6 SpendTime1  6.84 Kb 10.9 
##  7 Horoscope   2.17 Kb  3.45
##  8 Subject     1.77 Kb  2.80
##  9 IntExt      1.68 Kb  2.67
## 10 OptPest     1.68 Kb  2.67
## 11 Gender      1.66 Kb  2.64
## 12 ID          1.49 Kb  2.37
## 13 Grade       1.49 Kb  2.37
## 14 ScreenTime  1.49 Kb  2.37
## 15 Sleep       1.49 Kb  2.37
## 16 PhysActive  1.49 Kb  2.37
## 17 HrsHomework 1.49 Kb  2.37

Compare between youngGrades and oldGrades

## The `show_plot = TRUE` is deprecated and will be removed in a future version.  The `show_plot()` function should be used instead.  For more info, check out the help file ?show_plot()

## # A tibble: 17 x 5
##    col_name    size_1    size_2    pcnt_1 pcnt_2
##    <chr>       <chr>     <chr>      <dbl>  <dbl>
##  1 SpendTime2  4.09 Kb   4.79 Kb    12.9   12.2 
##  2 Career      3.91 Kb   5.33 Kb    12.3   13.6 
##  3 Superpower  3.8 Kb    4.95 Kb    12.0   12.6 
##  4 Self1       3.65 Kb   4.51 Kb    11.5   11.5 
##  5 Self2       3.61 Kb   5.02 Kb    11.4   12.8 
##  6 SpendTime1  3.5 Kb    3.87 Kb    11.0    9.85
##  7 Horoscope   1.38 Kb   1.52 Kb     4.34   3.88
##  8 Subject     992 bytes 1.12 Kb     3.06   2.85
##  9 IntExt      904 bytes 1.03 Kb     2.78   2.63
## 10 OptPest     904 bytes 1.03 Kb     2.78   2.63
## 11 Gender      888 bytes 1.02 Kb     2.74   2.59
## 12 ID          712 bytes 864 bytes   2.19   2.15
## 13 Grade       712 bytes 864 bytes   2.19   2.15
## 14 ScreenTime  712 bytes 864 bytes   2.19   2.15
## 15 Sleep       712 bytes 864 bytes   2.19   2.15
## 16 PhysActive  712 bytes 864 bytes   2.19   2.15
## 17 HrsHomework 712 bytes 864 bytes   2.19   2.15

inspect_na() function

## The `show_plot = TRUE` is deprecated and will be removed in a future version.  The `show_plot()` function should be used instead.  For more info, check out the help file ?show_plot()

## # A tibble: 17 x 3
##    col_name      cnt  pcnt
##    <chr>       <dbl> <dbl>
##  1 Superpower      8 4.32 
##  2 Self1           2 1.08 
##  3 Career          2 1.08 
##  4 Subject         1 0.541
##  5 PhysActive      1 0.541
##  6 Self2           1 0.541
##  7 ID              0 0    
##  8 Gender          0 0    
##  9 Grade           0 0    
## 10 Horoscope       0 0    
## 11 IntExt          0 0    
## 12 OptPest         0 0    
## 13 ScreenTime      0 0    
## 14 Sleep           0 0    
## 15 HrsHomework     0 0    
## 16 SpendTime1      0 0    
## 17 SpendTime2      0 0

Compare between youngGrades and oldGrades

## The `show_plot = TRUE` is deprecated and will be removed in a future version.  The `show_plot()` function should be used instead.  For more info, check out the help file ?show_plot()

## # A tibble: 17 x 6
##    col_name    cnt_1 pcnt_1 cnt_2 pcnt_2 p_value
##    <chr>       <dbl>  <dbl> <dbl>  <dbl>   <dbl>
##  1 Superpower      3   3.61     5   4.90   0.948
##  2 Self1           2   2.41     0   0      0.389
##  3 Career          2   2.41     0   0      0.389
##  4 Subject         1   1.20     0   0      0.918
##  5 PhysActive      1   1.20     0   0      0.918
##  6 Self2           1   1.20     0   0      0.918
##  7 ID              0   0        0   0     NA    
##  8 Gender          0   0        0   0     NA    
##  9 Grade           0   0        0   0     NA    
## 10 Horoscope       0   0        0   0     NA    
## 11 IntExt          0   0        0   0     NA    
## 12 OptPest         0   0        0   0     NA    
## 13 ScreenTime      0   0        0   0     NA    
## 14 Sleep           0   0        0   0     NA    
## 15 HrsHomework     0   0        0   0     NA    
## 16 SpendTime1      0   0        0   0     NA    
## 17 SpendTime2      0   0        0   0     NA

inspect_num() function

## The `show_plot = TRUE` is deprecated and will be removed in a future version.  The `show_plot()` function should be used instead.  For more info, check out the help file ?show_plot()

## # A tibble: 6 x 10
##   col_name     min    q1 median  mean    q3   max    sd pcnt_na hist       
##   <chr>      <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <list>     
## 1 Grade          3     5      6  5.74     7     8  1.39   0     <tibble [2~
## 2 HrsHomewo~     0     1      3  4.17     6    35  4.79   0     <tibble [2~
## 3 ID             1    47     93 93      139   185 53.5    0     <tibble [2~
## 4 PhysActive     0     6      9 11.5     12    82 11.8    0.541 <tibble [1~
## 5 ScreenTime     0     1      3  3.00     4    18  2.32   0     <tibble [2~
## 6 Sleep          2     8      9  8.64    10    12  1.54   0     <tibble [2~

Compare between youngGrades and oldGrades

## The `show_plot = TRUE` is deprecated and will be removed in a future version.  The `show_plot()` function should be used instead.  For more info, check out the help file ?show_plot()

## # A tibble: 6 x 5
##   col_name    hist_1            hist_2               jsd  fisher_p
##   <chr>       <list>            <list>             <dbl>     <dbl>
## 1 Grade       <tibble [22 x 2]> <tibble [22 x 2]> 1      0.0001000
## 2 HrsHomework <tibble [23 x 2]> <tibble [23 x 2]> 0.0837 0.203    
## 3 ID          <tibble [20 x 2]> <tibble [20 x 2]> 0.471  0.0001000
## 4 PhysActive  <tibble [19 x 2]> <tibble [19 x 2]> 0.0626 0.189    
## 5 ScreenTime  <tibble [16 x 2]> <tibble [16 x 2]> 0.122  0.000300 
## 6 Sleep       <tibble [20 x 2]> <tibble [20 x 2]> 0.132  0.000800

inspect_imb() function to identify factors which might be overly prevalent.

## The `show_plot = TRUE` is deprecated and will be removed in a future version.  The `show_plot()` function should be used instead.  For more info, check out the help file ?show_plot()

## # A tibble: 11 x 4
##    col_name   value        pcnt   cnt
##    <chr>      <chr>       <dbl> <int>
##  1 OptPest    Optimist    53.0     98
##  2 IntExt     Extravert   51.4     95
##  3 Gender     female      49.7     92
##  4 Subject    Gym         37.3     69
##  5 Horoscope  Aries       11.4     21
##  6 SpendTime1 video games 11.4     21
##  7 Self1      funny        9.73    18
##  8 Career     lawyer       7.03    13
##  9 SpendTime2 reading      6.49    12
## 10 Superpower fly          5.95    11
## 11 Self2      active       4.86     9

Compare between youngGrades and oldGrades

## The `show_plot = TRUE` is deprecated and will be removed in a future version.  The `show_plot()` function should be used instead.  For more info, check out the help file ?show_plot()

## # A tibble: 11 x 7
##    col_name   value       pcnt_1 cnt_1 pcnt_2 cnt_2 p_value
##    <chr>      <chr>        <dbl> <int>  <dbl> <int>   <dbl>
##  1 OptPest    Optimist     54.2     45  52.0     53   0.875
##  2 IntExt     Extravert    51.8     43  51.0     52   1.000
##  3 Gender     female       49.4     41  50       51   1.000
##  4 Subject    Gym          38.6     32  36.3     37   0.868
##  5 Horoscope  Scorpio      15.7     13  NA       NA  NA    
##  6 SpendTime1 video games  15.7     13   7.84     8   0.151
##  7 Self1      funny        10.8      9   8.82     9   0.832
##  8 SpendTime2 reading       9.64     8  NA       NA  NA    
##  9 Self2      active        8.43     7  NA       NA  NA    
## 10 Superpower fly           8.43     7  NA       NA  NA    
## 11 Career     engineer      7.23     6  NA       NA  NA

inspect_cat() function

## The `show_plot = TRUE` is deprecated and will be removed in a future version.  The `show_plot()` function should be used instead.  For more info, check out the help file ?show_plot()

## # A tibble: 11 x 5
##    col_name     cnt common      common_pcnt levels            
##    <chr>      <int> <chr>             <dbl> <list>            
##  1 Career       107 lawyer             7.03 <tibble [107 x 3]>
##  2 Gender         3 female            49.7  <tibble [3 x 3]>  
##  3 Horoscope     12 Aries             11.4  <tibble [12 x 3]> 
##  4 IntExt         3 Extravert         51.4  <tibble [3 x 3]>  
##  5 OptPest        3 Optimist          53.0  <tibble [3 x 3]>  
##  6 Self1        100 funny              9.73 <tibble [100 x 3]>
##  7 Self2        102 active             4.86 <tibble [102 x 3]>
##  8 SpendTime1    90 video games       11.4  <tibble [90 x 3]> 
##  9 SpendTime2   107 reading            6.49 <tibble [107 x 3]>
## 10 Subject        6 Gym               37.3  <tibble [6 x 3]>  
## 11 Superpower    99 fly                5.95 <tibble [99 x 3]>

Compare between youngGrades and oldGrades

## The `show_plot = TRUE` is deprecated and will be removed in a future version.  The `show_plot()` function should be used instead.  For more info, check out the help file ?show_plot()

## # A tibble: 11 x 5
##    col_name        jsd fisher_p lvls_1            lvls_2           
##    <chr>         <dbl>    <dbl> <list>            <list>           
##  1 Career     0.591    0.262    <tibble [53 x 3]> <tibble [74 x 3]>
##  2 Gender     0.000102 1        <tibble [3 x 3]>  <tibble [3 x 3]> 
##  3 Horoscope  0.0590   0.193    <tibble [12 x 3]> <tibble [12 x 3]>
##  4 IntExt     0.0281   0.0300   <tibble [3 x 3]>  <tibble [3 x 3]> 
##  5 OptPest    0.000386 0.980    <tibble [3 x 3]>  <tibble [3 x 3]> 
##  6 Self1      0.602    0.0188   <tibble [52 x 3]> <tibble [64 x 3]>
##  7 Self2      0.582    0.0784   <tibble [52 x 3]> <tibble [73 x 3]>
##  8 SpendTime1 0.620    0.000400 <tibble [47 x 3]> <tibble [51 x 3]>
##  9 SpendTime2 0.656    0.0103   <tibble [57 x 3]> <tibble [66 x 3]>
## 10 Subject    0.00837  0.935    <tibble [6 x 3]>  <tibble [5 x 3]> 
## 11 Superpower 0.563    0.126    <tibble [50 x 3]> <tibble [65 x 3]>

inspect_cor() function

Why insepct_cor over base corr?

  • cor() requires numeric inputs only
  • Correlation matrices are hard to read
  • cor() doesn’t produce confidence intervals
  • cor() and cor.test() don’t provide visualisation methods out of the box
## The `show_plot = TRUE` is deprecated and will be removed in a future version.  The `show_plot()` function should be used instead.  For more info, check out the help file ?show_plot()

## # A tibble: 15 x 6
##    col_1       col_2         corr  p_value   lower   upper
##    <chr>       <chr>        <dbl>    <dbl>   <dbl>   <dbl>
##  1 Sleep       ScreenTime -0.481  4.38e-12 -0.598  -0.343 
##  2 ScreenTime  Grade       0.413  5.19e- 9  0.266   0.541 
##  3 Sleep       Grade      -0.331  4.08e- 6 -0.470  -0.176 
##  4 HrsHomework ScreenTime  0.219  2.70e- 3  0.0568  0.371 
##  5 PhysActive  Grade      -0.192  8.97e- 3 -0.346  -0.0280
##  6 PhysActive  ScreenTime -0.176  1.71e- 2 -0.331  -0.0109
##  7 HrsHomework ID         -0.145  4.89e- 2 -0.302   0.0201
##  8 HrsHomework Grade       0.122  9.87e- 2 -0.0437  0.281 
##  9 HrsHomework PhysActive  0.115  1.21e- 1 -0.0513  0.275 
## 10 Sleep       ID         -0.0629 3.95e- 1 -0.225   0.103 
## 11 HrsHomework Sleep      -0.0586 4.28e- 1 -0.221   0.107 
## 12 ScreenTime  ID          0.0445 5.47e- 1 -0.121   0.208 
## 13 PhysActive  ID          0.0284 7.02e- 1 -0.137   0.193 
## 14 Grade       ID          0.0217 7.70e- 1 -0.143   0.186 
## 15 PhysActive  Sleep       0.0148 8.42e- 1 -0.151   0.179

Compare between youngGrades and oldGrades

## The `show_plot = TRUE` is deprecated and will be removed in a future version.  The `show_plot()` function should be used instead.  For more info, check out the help file ?show_plot()

## # A tibble: 15 x 5
##    col_1       col_2          corr_1  corr_2 p_value
##    <chr>       <chr>           <dbl>   <dbl>   <dbl>
##  1 ScreenTime  Grade       0.500      0.255   0.0550
##  2 Sleep       Grade      -0.322     -0.142   0.204 
##  3 Sleep       ScreenTime -0.291     -0.506   0.0869
##  4 HrsHomework PhysActive  0.268     -0.0336  0.0403
##  5 PhysActive  Grade      -0.224     -0.0621  0.272 
##  6 ScreenTime  ID          0.174     -0.0300  0.171 
##  7 PhysActive  ScreenTime -0.169     -0.185   0.911 
##  8 Grade       ID          0.163     -0.109   0.0685
##  9 HrsHomework ScreenTime  0.121      0.236   0.427 
## 10 HrsHomework Grade       0.0963     0.0887  0.959 
## 11 HrsHomework ID         -0.0569    -0.209   0.302 
## 12 PhysActive  Sleep      -0.0343    -0.0248  0.949 
## 13 HrsHomework Sleep      -0.0223    -0.0443  0.883 
## 14 Sleep       ID          0.00728   -0.0993  0.477 
## 15 PhysActive  ID          0.0000702  0.119   0.425

Overall Impression

I like this package because it’s got a lot of functionality and it’s incredibly straightforward. In short it allows you to understand and visualize column types, sizes, values, value imbalance & distributions as well as correlations. Further to this, it allows you to very easily perform any of the above features for an individual data frame, or to compare the differences between two data frames.

Miscellaneous Code

Imputed Data

MICE

MICE assumes that the missing data are Missing at Random (MAR), which means that the probability that a value is missing depends only on observed value and can be predicted using them. It imputes data on a variable by variable basis by specifying an imputation model per variable.

For example: Suppose we have X1, X2..Xk variables. If X1 has missing values, then it will be regressed on other variables X2 to Xk. The missing values in X1 will be then replaced by predictive values obtained. Similarly, if X2 has missing values, then X1, X3 to Xk variables will be used in prediction model as independent variables. Later, missing values will be replaced with predicted values.

By default, linear regression is used to predict continuous missing values. Logistic regression is used for categorical missing values. Once this cycle is complete, multiple data sets are generated. These data sets differ only in imputed missing values. Generally, it’s considered to be a good practice to build models on these data sets separately and combining their results.

Impute With missForest

missForest is an implementation of random forest algorithm. It’s a non parametric imputation method applicable to various variable types.It builds a random forest model for each variable. Then it uses the model to predict missing values in the variable with the help of observed values.

Since bagging works well on categorical variable too, we don’t need to remove them here.

It appears character variables cause missForest to error

missForest not a good solution when few distinct values in a column

Hmisc

aregImpute() allows mean imputation using additive regression, bootstrapping, and predictive mean matching.

In bootstrapping, different bootstrap resamples are used for each of multiple imputations. Then, a flexible additive model (non parametric regression method) is fitted on samples taken with replacements from original data and missing values (acts as dependent variable) are predicted using non-missing values (independent variable).

Then, it uses predictive mean matching (default) to impute missing values. Predictive mean matching works well for continuous and categorical (binary & multi-level)

Outlier Management

Change the declared y-value to the column of interest

Below, change the variable cutOff to meet you needs. cutOff determines the number of unique values in each numeric column needed to produce a boxplot.

Custom Outlier Function

Examine the top 25 rows of a column containing outliers Change the column name to the specified column

Outlier Handling Options

The Tukey method formula: The bottom threshold is: Q1 - 3IQR. All below are considered as outliers. The top threshold is: Q1 + 3(IQR). All above are considered as outliers.

The Hampel method formula: The bottom threshold is: median_value ??? 3(mad_value). All below are considered as outliers. The top threshold is: median_value + 3(mad_value). All above are considered as outliers.

Change columns to columns of interest

Synthetic Data Generation

synthpop

The ‘synthpop’ package is great for synthesizing data for statistical disclosure control or creating training data for model development. Other things to note,

  • Synthesizing a single table is fast and simple.
  • Watch out for over-fitting particularly with factors with many levels. Ensure the visit sequence is reasonable.
  • You are not constrained by only the supported methods, you can build your own!

Sample data

##      sex age          socprof income marital depress sport nofriend smoke
## 1 FEMALE  57          RETIRED    800 MARRIED       6    NO        6    NO
## 2   MALE  20 PUPIL OR STUDENT    350  SINGLE       0    NO        4    NO
## 3 FEMALE  18 PUPIL OR STUDENT     NA  SINGLE       0    NO       20    NO
## 4 FEMALE  78          RETIRED    900 WIDOWED      16   YES        0    NO
## 5 FEMALE  54    SELF-EMPLOYED   1500 MARRIED       4   YES        6   YES
## 6   MALE  20 PUPIL OR STUDENT     -8  SINGLE       5    NO       10    NO
##   nociga alcabuse  bmi
## 1     -8       NO 30.8
## 2     -8       NO 23.4
## 3     -8       NO 18.4
## 4     -8       NO 30.5
## 5     20       NO 20.0
## 6     -8       NO 23.9

The objective of synthesizing data is to generate a data set which resembles the original as closely as possible, warts and all, meaning also preserving the missing value structure. There are two ways to deal with missing values:

  1. impute/treat missing values before synthesis
  2. synthesize the missing values and deal with the missing later

The second option is generally better since the purpose the data is supporting may influence how the missing values are treated.

Missing values can be simply NA or some numeric code specified by the collection. A useful inclusion is the syn function allows for different NA types, for example income, nofriend and nociga features -8 as a missing value. A list is passed to the function in the following form.

By not including this the -8’s will be treated as a numeric value and may distort the synthesis. After synthesis, there is often a need to post process the data to ensure it is logically consistent. For example, anyone who is married must be over 18 and anyone who doesn’t smoke shouldn’t have a value recorded for ‘number of cigarettes consumed’. These rules can be applied during synthesis rather than needing adhoc post processing.

The variables in the condition need to be synthesized before applying the rule otherwise the function will throw an error. In this case age should be synthesized before marital and smoke should be synthesized before nociga.

There is one person with a bmi of 450.

##         sex age agegr            placesize    region       edu
## 3953 FEMALE  58 45-59 URBAN 20,000-100,000 Lubelskie SECONDARY
##                         eduspec                 socprof unempdur income
## 3953 economy and administration LONG-TERM SICK/DISABLED       -8   1300
##      marital mmarr ymarr msepdiv ysepdiv               ls depress
## 3953 MARRIED     4  1982      NA      NA MOSTLY SATISFIED       1
##                         trust trustfam trustneigh sport nofriend smoke
## 3953 ONE CAN`T BE TOO CAREFUL      YES        YES   YES        2    NO
##      nociga alcabuse alcsol workab wkabdur wkabint wkabintdur emcc englang
## 3953     -8       NO     NO     NO      -8      NO       <NA> <NA>    NONE
##      height weight bmi
## 3953    149     NA 450

Their weight is missing from the data set and would need to be for this to be accurate. I don’t believe this is correct! So, any bmi over 75 (which is still very high) will be considered a missing value and corrected before synthesis.

The data can now be synthesized using the following code.

## 
## Unexpected values (not obeying the rules) found for variable(s): nociga.
## Rules have been applied but make sure they are correct.
## Synthesis
## -----------
##  sex age socprof income marital depress sport nofriend smoke nociga
##  alcabuse bmi
## Call:
## ($call) syn(data = original.df, rules = rules.list, rvalues = rules.value.list, 
##     cont.na = cont.na.list, seed = myseed)
## 
## Number of synthesised data sets: 
## ($m)  1 
## 
## First rows of synthesised data set: 
## ($syn)
##      sex age                 socprof income marital depress sport nofriend
## 1   MALE  50                  FARMER     -8 MARRIED       5   YES        4
## 2   MALE  29 LONG-TERM SICK/DISABLED    600  SINGLE       4   YES        3
## 3 FEMALE  24        PUPIL OR STUDENT     NA  SINGLE       2    NO        5
## 4   MALE  62                 RETIRED   3200 MARRIED      15   YES        5
## 5   MALE  84                 RETIRED   2121 WIDOWED      14   YES        1
## 6   MALE  75                 RETIRED     -8 MARRIED      NA   YES        4
##   smoke nociga alcabuse  bmi
## 1    NO     -8       NO 29.3
## 2   YES     15       NO 22.2
## 3    NO     -8      YES 19.7
## 4   YES     30       NO 23.2
## 5    NO     -8       NO 25.8
## 6    NO     -8       NO 27.7
## ...
## 
## Synthesising methods: 
## ($method)
##      sex      age  socprof   income  marital  depress    sport nofriend 
## "sample"   "cart"   "cart"   "cart"   "cart"   "cart"   "cart"   "cart" 
##    smoke   nociga alcabuse      bmi 
##   "cart"   "cart"   "cart"   "cart" 
## 
## Order of synthesis: 
## ($visit.sequence)
##      sex      age  socprof   income  marital  depress    sport nofriend 
##        1        2        3        4        5        6        7        8 
##    smoke   nociga alcabuse      bmi 
##        9       10       11       12 
## 
## Matrix of predictors: 
## ($predictor.matrix)
##          sex age socprof income marital depress sport nofriend smoke
## sex        0   0       0      0       0       0     0        0     0
## age        1   0       0      0       0       0     0        0     0
## socprof    1   1       0      0       0       0     0        0     0
## income     1   1       1      0       0       0     0        0     0
## marital    1   1       1      1       0       0     0        0     0
## depress    1   1       1      1       1       0     0        0     0
## sport      1   1       1      1       1       1     0        0     0
## nofriend   1   1       1      1       1       1     1        0     0
## smoke      1   1       1      1       1       1     1        1     0
## nociga     1   1       1      1       1       1     1        1     1
## alcabuse   1   1       1      1       1       1     1        1     1
## bmi        1   1       1      1       1       1     1        1     1
##          nociga alcabuse bmi
## sex           0        0   0
## age           0        0   0
## socprof       0        0   0
## income        0        0   0
## marital       0        0   0
## depress       0        0   0
## sport         0        0   0
## nofriend      0        0   0
## smoke         0        0   0
## nociga        0        0   0
## alcabuse      1        0   0
## bmi           1        1   0

The compare function allows for easy checking of the synthesized data.

Solid. The distributions are very well preserved. Did the rules work on the smoking variable?

##      nociga
## smoke   -8    1    2    3    4    5    6    7    8    9   10   12   13
##   YES   17   15   18   16   10   82   34   13   22    2  323   24    2
##   NO  3692    0    0    0    0    0    0    0    0    0    0    0    0
##      nociga
## smoke   14   15   16   17   18   20   21   24   25   29   30   35   40
##   YES    5  144    1    1    9  431    1    1   48    1   44    2   33
##   NO     0    0    0    0    0    0    0    0    0    0    0    0    0
##      nociga
## smoke   50
##   YES    1
##   NO     0

They did. All non-smokers have missing values for the number of cigarettes consumed.

compare can also be used for model output checking. A logistic regression model will be fit to find the important predictors of depression. The depression variable ranges from 0-21. This will be converted to

  • 0-7 - no evidence of depression (0)
  • 8-21 - evidence of depression (1)

This split leaves 3822 (0)’s and 1089 (1)’s for modelling.

## Warning: Note that all these results depend on the synthesis model being correct.
## 
## Fit to synthetic data set with a single synthesis.
## Inference to coefficients and standard errors that
## would be obtained from the observed data.
## 
## Call:
## glm.synds(formula = ifelse(depress > 7, 1, 0) ~ sex + age + log(income) + 
##     sport + nofriend + smoke + alcabuse + bmi, family = "binomial", 
##     data = synth.obj)
## 
## Combined estimates:
##             xpct(Beta) xpct(se.Beta) xpct(z)       Pr(>|xpct(z)|)    
## (Intercept)   -0.37376       0.71378   -0.52               0.6005    
## sexFEMALE      0.26372       0.09979    2.64               0.0082 ** 
## age            0.09369       0.00385   24.36 < 0.0000000000000002 ***
## log(income)   -0.79442       0.08859   -8.97 < 0.0000000000000002 ***
## sportNO       -0.76773       0.12299   -6.24        0.00000000043 ***
## nofriend      -0.02465       0.00845   -2.92               0.0035 ** 
## smokeNO       -0.05681       0.11369   -0.50               0.6173    
## alcabuseNO    -0.46311       0.20668   -2.24               0.0250 *  
## bmi            0.01152       0.01061    1.09               0.2776    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call used to fit models to the data:
## glm.synds(formula = ifelse(depress > 7, 1, 0) ~ sex + age + log(income) + 
##     sport + nofriend + smoke + alcabuse + bmi, family = "binomial", 
##     data = synth.obj)
## 
## Differences between results based on synthetic and observed data:
##             Std. coef diff p value CI overlap
## (Intercept)         -0.360   0.719      0.908
## sexFEMALE           -0.666   0.505      0.830
## age                  1.835   0.067      0.532
## log(income)         -0.319   0.750      0.919
## sportNO             -0.690   0.490      0.824
## nofriend            -2.230   0.026      0.431
## smokeNO             -1.320   0.187      0.663
## alcabuseNO           1.309   0.191      0.666
## bmi                  0.314   0.754      0.920
## 
## Measures for one synthesis and 9 coefficients
## Mean confidence interval overlap:  0.744
## Mean absolute std. coef diff:  1
## Lack-of-fit: 14; p-value 0.123 for test that synthesis model is compatible 
## with a chi-squared test with 9 degrees of freedom
## 
## Confidence interval plot:

While the model needs more work, the same conclusions would be made from both the original and synthetic data set as can be seen from the confidence intervals. Occasionally there may be contradicting conclusions made about a variable, accepting it in the observed data but not in the synthetic data for example. This scenario could be corrected by using different synthesis methods or altering the visit sequence.

Preserving Count Data

Released population data are often counts of people in geographical areas by demographic variables (age, sex, etc). Some cells in the table can be very small e.g. <5. For privacy reasons these cells are suppressed to protect peoples identity. With a synthetic data, suppression is not required given it contains no real people, assuming there is enough uncertainty in how the records are synthesized.

The existence of small cell counts opens a few questions,

  1. If very few records exist in a particular grouping (1-4 records in an area) can they be accurately simulated by synthpop?
  2. Is the structure of the count data preserved?

To test this 200 areas will be simulated to replicate possible real world scenarios. Area size will be randomly allocated ensuring a good mix of large and small population sizes. Population sizes are randomly drawn from a Poisson distribution with mean \(\lambda\). If large, \(\lambda\) is drawn from a uniform distribution on the interval [20, 40]. If small, \(\lambda\) is set to 1.

The sequence of synthesizing variables and the choice of predictors is important when there are rare events or low sample areas. If Synthesized very early in the procedure and used as a predictor for following variables, it’s likely the subsequent models will over-fit. Synthetic data sets require a level of uncertainty to reduce the risk of statistical disclosure, so this is not ideal.

Fortunately syn allows for modification of the predictor matrix. To avoid over-fitting, ‘area’ is the last variable to by synthesized and will only use sex and age as predictors. This is reasonable to capture the key population characteristics. Additionally, syn throws an error unless maxfaclevels is changed to the number of areas (the default is 60). This is to prevent poorly synthesized data for this reason and a warning message suggest to check the results, which is good practice.

## 
## Unexpected values (not obeying the rules) found for variable(s): nociga.
## Rules have been applied but make sure they are correct.

Changing the predictor matrix to predict area with only age and sex

##          sex age socprof income marital depress sport nofriend smoke
## sex        0   0       0      0       0       0     0        0     0
## age        1   0       0      0       0       0     0        0     0
## socprof    1   1       0      0       0       0     0        0     0
## income     1   1       1      0       0       0     0        0     0
## marital    1   1       1      1       0       0     0        0     0
## depress    1   1       1      1       1       0     0        0     0
## sport      1   1       1      1       1       1     0        0     0
## nofriend   1   1       1      1       1       1     1        0     0
## smoke      1   1       1      1       1       1     1        1     0
## nociga     1   1       1      1       1       1     1        1     1
## alcabuse   1   1       1      1       1       1     1        1     1
## bmi        1   1       1      1       1       1     1        1     1
## area       1   1       0      0       0       0     0        0     0
##          nociga alcabuse bmi area
## sex           0        0   0    0
## age           0        0   0    0
## socprof       0        0   0    0
## income        0        0   0    0
## marital       0        0   0    0
## depress       0        0   0    0
## sport         0        0   0    0
## nofriend      0        0   0    0
## smoke         0        0   0    0
## nociga        0        0   0    0
## alcabuse      1        0   0    0
## bmi           1        1   0    0
## area          0        0   0    0

Synthesizing New Predictor

## 
## Unexpected values (not obeying the rules) found for variable(s): nociga.
## Rules have been applied but make sure they are correct.
## Synthesis
## -----------
##  sex age socprof income marital depress sport nofriend smoke nociga
##  alcabuse bmi area

Compare Synthetic and Original Data

The area variable is simulated fairly well on simply age and sex. It captures the large and small areas, however the large areas are relatively more variable. This could use some fine tuning, but will stick with this for now.

##          sex
## area      MALE FEMALE
##   area1      3      0
##   area100   20     40
##   area101   23     33
##   area102   11     30
##   area104    9     18
##   area107   18     37
##   area108    6      8
##   area109    0      4
##   area110   32     41
##   area111   12     14
##   area112    6      3
##   area113   21     18
##   area114    0      2
##   area115    4      0
##   area117   11     33
##   area118    0      0
##   area12     0      0
##   area120   16     21
##   area121   16     30
##   area124   37     39

##          sex
## area      MALE FEMALE
##   area1      3      1
##   area100   26     30
##   area101   32     46
##   area102   21     39
##   area104   13     23
##   area107   25     36
##   area108    4      6
##   area109    1      2
##   area110   29     35
##   area111   28     26
##   area112    2      2
##   area113   17     18
##   area114    0      2
##   area115    1      1
##   area117   19     22
##   area118    1      0
##   area12     0      4
##   area120   20     28
##   area121   21     27
##   area124   31     32

The method does a good job at preserving the structure for the areas. How much variability is acceptable is up to the user and intended purpose. Using more predictors may provide a better fit. The errors are distributed around zero, a good sign no bias has leaked into the data from the synthesis.

The method does a good job at preserving the structure for the areas. How much variability is acceptable is up to the user and intended purpose. Using more predictors may provide a better fit. The errors are distributed around zero, a good sign no bias has leaked into the data from the synthesis.

##                     sex
## marital              MALE FEMALE
##   SINGLE              653    532
##   MARRIED            1403   1606
##   WIDOWED              61    505
##   DIVORCED             56    153
##   LEGALLY SEPARATED    10      0
##   DE FACTO SEPARATED    5     13
##                     sex
## marital              MALE FEMALE
##   SINGLE              657    596
##   MARRIED            1382   1597
##   WIDOWED              66    465
##   DIVORCED             62    137
##   LEGALLY SEPARATED     6      1
##   DE FACTO SEPARATED    4     18

At higher levels of aggregation the structure of tables is more maintained.

This material was derived from here: synthpop.